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Abstract Motivated by the imminent growth of massive, highly redundant genomic databases, we study 
the problem of compressing a string database while simultaneously supporting fast random access, sub- 
string extraction and pattern matching to the underlying string(s). Bille et al. (2011) recently showed 
how, given a straight-line program with r rules for a string s of length n, we can build an C'(r)-word data 
structure that allows us to extract any substring of length m in ©(log n + m) time. They also showed 
how, given a pattern p of length m and an edit distance k < m, their data structure supports find- 
ing all occ approximate matches to p in s in C'(r(min(mA:, + m) + logn) -I- occ) time. Rytter (2003) 
and Charikar et al. (2005) showed that r is always at least the number z of phrases in the LZ77 parse 
of s, and gave algorithms for building straight-line programs with O{z\ogn) rules. In this paper we 
give a simple O(zlogn)-word data structure that takes the same time for substring extraction but only 
©(z min(mA:, fc** + m) + occ) time for approximate pattern matching. 

Keywords Compressed pattern matching • Approximate pattern matching • LZ77 



1 Introduction 

The recent revolution in high-throughput sequencing technology has made the acquisition of large ge- 
nomic sequences drastically cheaper and faster. As the new technology takes hold, ambitious sequencing 
projects such as the 1,000 Human Genomes U and the 10,000 Vertebrate Genomes [TO] projects are 
set to create large databases of strings (genomes) that vary only slightly from each other, and so will 
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contain large numbers of long repetitions. Efficient storage of these collections is not enough: fast access 
to enable search and sequence alignment is paramount. The utility of such a data structure is not limited 
to the treatment of DNA collections. Ferragina and Manzini's recent study of the compressibility of web 
pages reveals enormous redundancy in web crawls [S] . Exploiting this redundancy to reduce space while 
simultaneously enabling fast access and search over crawled pages (for snippet generation or cached 
page retrieval) is a significant challenge. The problem of compressing and indexing such highly repetitive 
strings (or string collections) was introduced in [5D] (see also [H]). With an LZ78- or BWT-based data 
structure [U [6] we can store a string s of length n in space bounded in terms of the tth-order empirical 
entropy |16) . for any t = o(log^n), and later extract any substring of length m in ©(m/log^n) time. 
For very repetitive texts, however, compression based on the LZ77 [23 can use significantly fewer than 
nHtis) bits [20] • 

Rytter |19| showed that the number z of phrases in the LZ77 parse of s is at most the number of rules 
in the smallest straight-line program (SLP) for ^ He then showed how the LZ77 parse can be turned 
into an SLP for s with 0{z log n) rules whose parse-tree has height O(logn). This SLP can be viewed as a 
data structure that stores s in O{zlogn) words and supports substring extraction in O{\ogn + m) time. 
Bille, Landau, Raman, Rao, Sadakane and Weimann [5] showed how, given an SLP for s with r rules, we 
can build a data structure that takes 0(r) words and supports substring extraction in ©(log n + m) time 
regardless of the height of the parse tree. Unfortunately, since no polynomial-time algorithm is known 
to produce an SLP for s with o{z\ogn) rules, even with no bound on the height, we still do not know 
how, efficiently, to build a data structure that has better bounds than Rytter's. 

Bille et al. [2] also show how, given a pattern p of length m and an edit distance k < m, their data 
structure supports finding all occ approximate matches to p in s in 0{r{mm{mk, fc^ -I- m) -I- log n) + occ) 
time. Their main idea is that, if there is a rule X YZ in the SLP and we have already found all 
the approximate matches in expansions of Y and Z then, to find all the approximate matches in the 
expansion of X, we need only search the substring consisting of the m + k last characters of Y^s expansion 
concatenated with the first m + k characters of Z^s expansion. Extracting these characters with their data 
structure takes 0(logn + m) time per rule, or C'(r(logn -I- m)) time in total. In this paper we discuss 
two improvements to this idea: first, by the same argument, we need only search the m + k characters 
to either side of the phrase boundaries in the LZ77 parse; second, since we know in advance where those 
phrase boundaries are, we do not need the full power of random access. Our first observation immediately 
improves Bille et al.'s time bound for approximate matching to 0(^z{mm{mk,k'^ + m) + log?i) +occ), 
while our second has led us to develop a data structure whose time bound is Oi^z min(mfc, fc^ -I- m) + occ) . 

Neither Rytter's nor Bille et al.'s data structures are practical. However, in another strand of recent 
work, Kreft and Navarro [12l [13] introduced a variant of LZ77 called LZ-End and gave a data structure 
based on it with which we can store s in 0{z' log n) + o(n) bits, where z' is the number of phrases in 
the LZ-End parse of s, and later extract any phrase (not arbitrary substring) in time proportional to 
its length. The o(n) term can be removed at the cost of slowing extraction down by an ©(log n) factor. 
Extracting arbitrary substrings is fast in practice but could be slow in the worst case. Also, although 
the LZ-End encoding is small in practice for very repetitive strings, it is not clear whether z' can be 
bounded in terms of z. 

Our Contribution. In this paper we describe a simple ©(z logn)-word data structure, which we call 
the block graph for s, that takes 0{\ogn + £ ~ /) time to extract any substring s[f..i] but lets us add 
bookmarks to speed up extraction from pre-specified points. This allows us to find all occ approximate 
matches of a pattern of length m in 0(^z mm{mk + m,k* + m) + occ) time. Our space bound (in terms 
of z) and substring extraction time are the same as Bille et al. 's ,2 ; our approximate pattern matching 
time is faster both because we replace r by z (which, as noted above, they can too) and because we 
remove the log n term, which is due to the overhead for random access. More importantly, however, our 
results require much simpler machinery. We believe the block graph is the first practical data structure 
with solid theoretical guarantees for compression and retrieval of highly repetitive collections. 

In the next section we describe the block graph. Then, in Section [3] we relate the size of the block 
graph to the size of the LZ77 parsing of its underlying string. We show that a block graph naturally 
compresses the string while allowing efficient random access and extraction of substrings. In Section [4] 
we show how to augment the block graph to support fast approximate pattern matching. In Section ^ 

^ In this paper we consider only the version of LZ77 without self-referencing, sometimes called LZSS |21| . 



2 




abaa aaba baba 



Fig. 1 The block graph for the eighth Fibonacci string, abaababaabaababaababa, truncated at depth 3. 



we describe a practical implementation of the block graph and compare its performance to that of Kreft 
and Navarro's data structure. 

We note that the idea of searching only around phrase boundaries in the LZ77 parse could be useful 
in other contexts. For example, suppose we want to build an index for approximate pattern matching in 
a text and we know in advance reasonable upper bounds M and K on the lengths of the patterns and 
the edit distances in which we will be interested. We can extract the AI + K characters to either side 
of each boundary, obtaining substrings of length 2(M + K)] separate each pair of consecutive substrings 
by if + 1 copies of a character not in the alphabet; and build an index for the resulting modified string, 
which could be much smaller. For any pattern of length at most M and any edit distance at most the 
original string contains an approximate match if and only if the modified string does; moreover, from 
the positions of the approximate matches in the modified string and the structure of the LZ77 parse, we 
can use two-sided range reporting to deduce the positions of the approximate matches in the original 
string [S] . We hope to use similar ideas to reduce the space usage of hash-based indexes [52] . 

2 Block graphs 

For the moment, assume n = 2^ for some integer h. We start building the block graph of s with node 
which we call the root and consider to be at depth 0. For < d < t, for each node v = {i..i + b— 1) 
at depth d, where b = 2*"'' is the block size at depth d, we add pointers from v to nodes + 6/2 — 1), 
(i-f 6/4. .i-f 36/4— 1) and (z + 6/2..z + 6— 1), creating those nodes if necessary. We call these three nodes 
the children of v and each other's siblings, and we call v their parent. Notice that a node can have two 
parents. We associate with each node the block s[i..j] of characters in s. If n is not a power of 2, 

then we append blanks to s until it is. After building the block graph, we remove any nodes whose blocks 
contain only blanks or blanks and characters in another block at the same depth, and replace any node 
(i.-j) with j > nhy {i..n). We delete all pointers to any such nodes. 

We can reduce the size of the block graph by truncating it such that we keep only the nodes at depths 
where storing three pointers takes less space than storing a block of characters explicitly. We mark as 
an internal node each node whose block is the first occurrence of that substring in s. At the deepest 
internal nodes, instead of storing a pointer, we store the nodes' blocks explicitly. We mark as a leaf all 
nodes whose block is not unique and whose parents are internal nodes. We then remove any node that is 
not marked as an internal node or a leaf. Figure [l] shows the block graph for the eighth Fibonacci string, 
abaababaabaababaababa, truncated at depth 3. Oval nodes are internal nodes and rectangular nodes are 
leaves. Notice that the root has only two children, because the block for node (17.. 32) would contain 
only blanks and characters in s[9..21], so (17. .32) is removed; similarly, (21. .24) is removed. 

The key phase in building the block graph is updating the leaves' pointers, shown in Figure [T] as 
the arrows below rectangular nodes. Suppose a leaf u at depth d had a child (i..j), which was been 
removed because it was neither an internal node nor a leaf. Consider the first occurrence s[i' ..j'] in s of 
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the substring s[z..j]. Notice that s[i' ..j'] is completely contained within some block at depth d — this is 
one reason why we use overlapping blocks — and, since s[i' ..j'] is the first occurrence of that substring 
in s, that block is associated with an internal node v. We replace the pointer from u to (i.-j) by a pointer 
to V and the offset of i' in v^s block. For the example shown in Figure [ij (17.. 21) previously had children 
(17.. 20) and (19. .21). The blocks s[17..20] = abab and s[19..21] = aba, which first occur in positions 4 
and 1, respectively. Therefore, we replace (17..21)'s pointer to (17. .20) by a pointer to (1..8) and the 
offset 3; we replace its pointer to (19. .21) by another pointer to (1..8) and the offset 0. 

Extracting a single character s[i] in 0{\ogn) time is fairly straightforward: we start at the root and 
repeatedly descend to any child whose block contains s[i]; if we come to a leaf u such that s[i] is the jth 
character in u's block but, instead of pointing to a child whose block contains s[i], u stores a pointer 
to internal node v and offset c, then we follow u's pointer to v and extract the (j + c)th character in 
w's block; finally, when we arrive at an internal node with maximum depth, we report the appropriate 
character of its block, which is stored there explicitly. By definition the maximum depth of the block 
graph is logn and at each depth, we either descend immediately in 0(1) time, or follow a pointer from 
a leaf to an internal node in 0(1) time and then descend. Therefore, we use a total of 0(log7i) time. 

For example, suppose we want to extract the 11th character from s = abaababaabaababaababa using 
the block graph shown in Figure [l] Starting at the root, we can descend to cither child, since both their 
blocks contain s[ll]; suppose we descend to the left child, (1..16). From (1..16) we can descend to either 
the middle or right children; suppose we descend to the right child, (9. .16). Since (9. .16) is a leaf, the 
pointer to child (9. .12) has been replaced by a pointer to (1..8) and offset 0, while the pointer to child 
(11. .14) has been replaced by another pointer to (1..8) and offset 2. This is because the first occurrence 
of s[9..12] — abaa is s[1..4] and the first occurrence of s[11..14] — aaba is s[3..6]. Suppose we follow the 
second pointer. Since we would have extracted the first character from (11..14)'s block, we are now to 
extract the third character from (1..8)'s block. We can descend to either (1..4) and extract the third 
character of its block, or descend to (3.. 6) and extract the first character of its block. 

Extracting longer substrings is similar, but complicated by the fact that we want to avoid breaking 
the substring into too many pieces as we descend. In the next section we will show how to extract any 
substring of length m in O(logn + m) time; however, we first prove an upper bound on the block graph's 
size. 



3 Fast random access in compressed space 

In this section we show that block graphs achieve compression while simultaneously allowing easy access 
to the underlying string. Our space result relies on the following easily proved lemma. 

Lemma 1 ([7J) The first occurrence of any substring in s must touch at least one boundary between 
phrases in the LZ77 parse. 

Lemma [l] allows us to relate the size of the block graph to the LZ77 parsing of the underlying string, 
as summarized below. 

Theorem 1 The block graph for s tafces 0(z log^ n) bits. 

Proof Each internal node's block is the first occurrence of that substring in s so, by Proposition [T] it 
must touch at least one boundary between phrases in the LZ77 parse. Since each such boundary can 
touch at most three blocks in the same level, there are at most 3z internal nodes in each level. It follows 
that there are O(zlogn) nodes in all. Since each node stores O(logn) bits, the whole block graph takes 
O(zlog^n) bits. 

We define the query extract(u, i,j) to return the zth through jth characters in m's block. Notice that, 
if u is the root, then these characters are s[i..j]. We now show how to implement extract queries in such 
a way that extracting a substring of s with length m takes 0(log n + m) time. 

There are three cases to consider when performing extract(u, z, j): u could be an internal node at 
maximum depth, in which case we simply return the «th through jth characters of its block, which are 
stored explicitly; u could be an internal node with children; or u could be a leaf. First suppose that u is 
an internal node with children. Let d be u's depth and b = 2^^°^'^ nl-d- notice b is the length of u's block 
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unless the block is a suffix of s, in which case the block might be shorter. If the interval [i-.j] is completely 
contained in one of the intervals [1..6/2], [&/4 + 1..36/4] or [b/2 + l..b], then we set v to be the left, middle 
or right child of u, respectively (choosing arbitrarily if two intervals each completely contain and 
implement extract(u, i, j) as either extract('i;, i, j), extract(w, i — 6/4, j — b/4) or extract(u, i — b/2, j — 6/2). 
Otherwise, [i-.j] must be more than a quarter of and we can split [i-.j] into 2 or 3 subintervals, each 
of length at least 6/8 but completely contained in one of [1..6/2], [6/4 + 1..36/4] or [6/2 + 1..6]; this is 
the other reason why we use overlapping blocks. We implement extract(u, i, j) with an extract query for 
each subinterval. 

Now suppose that m is a leaf. Again, let d be u's depth and 6 = 2r'°S2"l-<i jf the interval is 
completely contained in one of the intervals [1..6/2], [6/4 + 1..36/4] or [6/2 + 1..6], then we set v to be 
the first, second or third internal node at the same depth to which u points, respectively, and implement 
extract(u, j, j) as extract(z), i', j'), where i' and / are i and j plus the appropriate offset. Otherwise, 
must be more than a quarter of [1..6]; we split [i-.j] into subintervals and implement extract(u, i, with 
an extract query for each subinterval, as before. 

Theorem 2 Extracting a substring s[f..i] from the block graph of s takes C'(logn + ^ — /) time. 

Proof Consider the query extract(root, /, £) and let d be the first depth at which we split the interval. 
Descending to depth d takes a total of 0{d) time. By induction, if we perform a query extract('i;, i,j) on 
a node v at depth d' > d, then j — i + 1 is more than a quarter of the block size 2^^°^^ at that level. 
It follows that we make o(^{£ - f + l)/2'°sn-'i'^ calls to extract at depth d' , each of which takes 
time. Summing over the depths, we use a total of ©(logn + ( — f) time. □ 

One interesting property of our block graph structure is that, at the cost of storing a node for every 
possible block of size n/2'^ — i.e., storing 0(2'' log n) extra bits — we can remove the top d levels and, 
thus, change the overall space bound to O[z{\ogn — d) \ogn + 2''logn) bits and reduce the access time 
to 0(logn — d). For example, if c? = logz, then we store a total of C'(zlognlog(n/z)) bits and need only 
O(log(n/z)) time for access. If d = log(n/log^ n), then we store a total of ©(z log n log log n + n/logn) 
bits and reduce the access time to ©(log log rt). 

Gonzalez and Navarro TT showed how, by applying grammar-based compression to a difference-coded 
sufRx array (SA), we can build a new kind of compressed suffix array that supports access to SA[z..j] in 
©(logn + ^ — /) time. It seems likely that, by using a modified block graph of the difference-coded suffix 
array instead of a grammar, we can improve their access time to ©(log log n + i — f) at the cost of only 
slightly increasing their space bound. 

4 Accelerated approximate pattern matching 

Suppose we are given an uncompressed string s of length n, the LZ77 parse [33] of s, a pattern p of 
length m < n and an edit distance k < m. The primary matches of p are the substrings of s within edit 
distance k oi p whose characters are all within distance (m + k) of phrase boundaries in the parse. It is 
not difficult to find all p's primary matches in C'(2:min(mfc -I- m, fc^ -I- m)) time, where z is the number 
of phrases. To do this, we extract the substrings all of whose characters are within distance (m -|- k) of 
phrase boundaries and apply to them either the sequential approximate pattern-matching algorithm by 
Landau and Vishkin [T3] or the one by Cole and Hariharan [3J. 

Once we have found p's primary matches, we can use them to find the approximate matches not 
within distance (m -I- k) of any phrase boundary, which are called p's secondary matches. To do this, 
we process the phrases from left to right, maintaining a sorted list of the approximate matches we have 
already found. For each phrase copied from a previous substring s[i..j], we search in the list to see if 
there are any approximate matches in s[«..j] that are not completely contained in s[i..i + m + k — 1] or 
s[j — m — k + If there are, we insert the corresponding secondary matches in our list. Processing 
all the phrases takes 0{z + occ) time, where occ is the number of approximate matches to p in s. Notice 
that finding p's secondary matches does not require access to s. 

As noted in Section [ij Bille et al. [5] showed how, given a straight- line program for s with r rules, we 
can build an 0(r)-word data structure that allows us to extract any substring s[f.I] in O{logn + £ ~ f) 
time. When the straight-line program is built with the best known algorithm for approximately minimiz- 
ing the number of rules, r — 0{z\ogn) [TH]. It follows that we can store s in O{z\ogn) words such that. 
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givenp and fc, in 0{z(logn + m)) time we can extract all the characters within distance (m + fc) of phrase 
boundaries and, therefore, find all p's approximate matches in 0(^z{uim{'mk + m, fc^ + m) + log n) + occ) 
time. (Bille et al. themselves gave a bound of C'(r(min(mfc + m,k^ + m) + logn) + occ) but, since even 
the smallest straight-line program for s has at least z rules [19], the one we state is slightly stronger.) 

The key to supporting approximate pattern matching in the block graph is the addition of bookmarks, 
which will allow us to quickly extract certain regions of the underlying string. To add a bookmark to a 
character s[i], for each block size h in the block graph, we store pointers to the two nodes whose blocks 
of size 2h completely contain the first occurrence of the substrings s[i — b + and s[i..i + b — 1], and 
those occurrences' offsets in the blocks. Thus, storing a bookmark takes O(logn) words. To extract a 
substring that touches s[z], we extract, separately, the parts of the substring to the left and right of s[i]. 
Without loss of generality, we assume the part s[i..j] to the right is longer and consider only how to 
extract it. We first find the smallest block size 6 > j — i + 1, then follow the pointer to the node whose 
block of size 26 contains the first occurrence s[i..i + 6—1]. Since that node has height 0(log(j — i + 1)), 
we can extract s[i..j] in 0{j — i + 1) time. 

Lemma 2 Extracting a substring s[J..P\ that touches a bookmark takes 0{£ — f) time. 

Inserting a bookmark to each phrase boundary in the LZ77 parse takes O(zlogn) words and allows 
us, given m and fc, to extract the characters within distance (m + k) of phrase boundaries in a total of 
0{zm) time. Combined with the approach described above for finding secondary occurrences, we have 
our main result. 

Theorem 3 Let s be a string of length n whose LZ77 parse consists of z phrases. We can store s in 
O{zlogn) words such that, given a pattern p of length m < n and an edit distance k < m, we can find 
all occ substrings of s within edit distance k of p in 0(^zinm{mk + m,k'^ + m) + occ) time. 

Note that, in the above theorem, the time to find all p's approximate matches is the same as if we 
were keeping s uncompressed, as in the approach described at the start of this section. 

We note in passing that we can combine our results with those of Kreft and Navarro P3j to obtain 
a new worst-case upper bound for LZ77-based indexing. Specifically, replacing their data structures for 
access to the string by a block graph with a bookmark at each phrase boundary, and replacing two of 
their other data structures by faster (and larger, but still 0(zlog^ n) bits) data structures, we can store 
s in 0(2: log^ n) bits such that, given a pattern p of length m, we can find all occurrences of p in s in 
0{m? -|- (to -|- occ) log log z) time. Their index is practical but potentially larger and slower in the worst 
case. 

5 Efficient representation of block graphs 

We now describe an implementation of block graphs which is efficient in practice. The main idea is to 
represent the shape of the graph (the internal nodes and their pointers) using bitvectors and operations 
from succinct data structures, and to carefully allocate space for the leaf nodes depending on their 
distance from the root. Below we make use of two familiar operations for bitvectors: rank and select. 
Given a bitvector B, a position i, and a type of bit b (either or 1), rankb{B,i) returns the number of 
occurrences of b before position i in B and selectb{B, i) returns the position of the ith b in B. Efficient 
data structures supporting these operations have been extensively studied (see, e.g. [TT t fTS ] ). 

Each level of the block graph consists of a number of nodes, either internal nodes, or leaves. Let Bd 
be a bitvector which says whether the ith node (from the left) at depth d is a leaf, Bd[i] — 0, or an 
internal node Bd[i] — 1. We define another bitvector Rd, where Rd[i] = 1 if and only if Bd[i] — 1 and 
Bd[i -l- 1] = 1 for i < n — 1. That is, we mark a 1 bit for each instance of two adjacent internal nodes 
in Bd, otherwise Rd[i] — 0. Let Ld be an array that holds leaf nodes at depth d. The structure of a 
leaf node is discussed below. Finally, let T be the concatenation of the textual representation (ie. the 
corresponding substrings) of all internal nodes at the truncated depth. 

Navigating the block graph. The main operation is to traverse from an internal node to one of its three 
children. Say we are currently at the jth internal node at depth d of the block graph — that is, we are at 
Bd[i], where i = selecti{Bd, j). Each internal node has three children. If these children were independent 
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then locating the left child of the current node would be simply three times the node's position on its 
level, that is 3 j = 3 • ranki{Bd,i). However, in a block graph adjacent internal nodes share exactly one 
child, so we correct for this by subtracting the number of adjacent internal nodes at this depth prior to 
the current node — this is given by ranki{Rii,i). To find the position corresponding to the left child of 
a node in B^+i we compute 

leftchild(_Brf, i) — 5ranki{B(i,i) — ranki(Rd,i) 

Given the address of the left child it is easy to find the center or right child by adding 1 or 2 
respectively to the result of leftchild. If B^li] = then we are at a leaf node. Intuitively, to access its leaf 
information in we call Ld[ranko{B(i,i)]. Once we reach the truncated depth to access the text of an 
internal node we compute its offset in T, T[{ranki{Bd,i) * truncated length)]. 

Leaf nodes. In a block graph leaves point to internal nodes. For each leaf we store two values, the 
position of the destination node on the current level, and an offset in the destination node pointing to 
the beginning of the leaf block. Note that we do not need to store the depth of the destination node. It 
is, by definition, on the level above the leaf, and we know this by keeping keep track of the depth during 
each step in a traversal. To improve compression we store leaf positions and offsets in two separate arrays. 
At depth d there are no more than 2*^+^ — 1 possible nodes, so we can store each position in log(2'^'+^ — 1) 
bits. Given that the length of a node at depth disb — 2r'°snl-'i jeaf nodes point to an internal node 
on the level above, we store each offset in log(2r'°s"l-'i-i) bj^g^ 



6 Experiments 

We have developed an implementation of block graphs]^ and tested it on the real-world texts of the 
Pizza-Chili Repetitive Corpu^ a standard testbed for data structures designed for repetitive strings. 

We compared compression acheived by the block graph to the LZ-End data structure by Kreft and 
Navarro T^, and to the general-purpose compressors gzip and 7zip; the results are shown in Table [l] 
We used gzip and 7zip with the settings -9 and -t7z -mO=lzma -mx=9 -mfb=64 -md=32m -ms=on, re- 
spectively, while LZ-End was executed with its default settings. Throughout our experiments all block 
graphs were truncated such that the smallest blocks each took 4 bytes. Note that gzip and 7zip provide 
compression only, not random access, and are included as reference points for acheivable compression. 

We then compared how quickly block graphs and LZ-End support extracting substrings of vari- 
ous lengths; the results are shown in Figure [2] Each run of extractions was performed across 10,000 
randomly-generated queries. Experiments were conducted on an Intel Core 17-2600 3.4 GHz proces- 
sor with 8GB of main memory, running Linux 3.3.4; code was compiled with GCC version 4.7.0 tar- 
geting x86_64 with full optimizations. Caches were dropped between runs with sync && echo 1 > 
/proc/sys/vm/drop_caches. 

Although 7zip achieves much better compression block graphs achieve better compression than gzip 
except on the Escherichia Coli and influenza files. Most importantly, our experiments show that block 
graphs generally achieve compression comparable to that achieved by LZ-End while supporting signifi- 
cantly faster substring extraction. 



7 Conclusions 

Efficient storage and retrieval of highly repetitive strings, and approximate pattern matching in them, 
are important tools in bioinformatics and will become even more important as genomic databases grow. 
In this paper we have presented a new data structure, the block graphs that stores highly repetitive 
strings in compressed space, supports random access in reasonable time and supports extraction from 
pre-specified points much faster. Our analysis and experiments show that the block graph is competitive 
both in theory and in practice. 

^ Available at http : //wm . glthub ■ com/choobln/block- graph 
^ .http: //pizzachili . dec .uchile . cl/repcorpus .html 
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Table 1 Size in bytes of repetitive corpus files encoded with ASCII, gzip, 7zip, LZ-End and block graphs. 



Collection 



ASCII 



gzip 



7zip 



LZ-End 



Block graph 



Escherichia CoU 



112,689,515 
461,286,644 
205,281,778 
467,626,544 
154,808,555 
257,961,616 
429,265,758 
46,968,181 



31,535,023 
120,834,282 

49,920,838 
163,664,285 

10,636,899 

69,396,104 
116,073,220 
8,287,665 



6,147,962 
6,077,972 
3,999,812 

323,779 
2,111,974 
2,087,006 
8,117,573 

606,438 



49,106,638 
41,342,784 
35,863,520 

2,247,204 
21,507,089 
19,347,734 
57,415,176 

4,525,317 



49,716,456 
57,689,376 
47,795,692 

3,969,392 
33,171,036 
24,045,332 
72,393,196 

7,321,720 



cere 



coreutils 

einstein.en.txt 

influenza 

kernel 

para 

world leaders 
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